About
About the Data Science Accelerator
The Data Science Accelerator is a capability-building programme which gives analysts from across the public sector the opportunity to develop their data science skills.
For this project, I will be guided by an experienced data scientist from the Ministry of Justice and based at the GDS hub in Whitechapel once a week between December 2018 and March 2019. As part of this accelerator, I aim to explore data science techniques to produce meaningful and helpful insights both to other teams within the Greater London Authority, as well as colleagues at participating London boroughs.
About the voter registration project
The main objective of the project is to use machine learning techniques to spot patterns and trends within the data that we would not detect with more traditional methods usually associated with this kind of data. The focus, for now, is on understanding the data. We hope that this will lead to insights, recommendations, ideas and confidence to pursue better-informed campaigns to maximize voter registration in the capital.
Please be aware that this project is exploratory in nature. So anything found on these pages are not carved in stone. If you have been sent a link to this page, we assume you have some kind of interest, so we are interested in your views.
Boroughs included in this analysis
- Lewisham
- Brent
- Wandsworth
- Waltham Forest
- Greenwich
- Lambeth
- Croydon
Details about this site and different types of analysis
This site is the output of this project. Anything I learn, will be added to these pages. If any of it is not clear, or difficult to understand, please do get in touch and I will try to update accordingly. This site is hosted on github and written in RMarkdown, using the flexdashboard and crosstalk packages for functionality.
The main way I have attempted to communicate my results is using storyboard pages with navigation in large squared along the top of page. This can result in very large html files that can take a long time to load. So I have created a different storyboard for each of the different types of my analysis, as detailed below.
General information: details about the project, its background, stated aims and its data sources
Clustering: part of the data exploration, where I look for distinct groupings within the data
Predictive modelling: Statistical models that can predict which areas will have lower rates of voter registration. Here I use a particular type of model called a ‘decision tree’ that is very strong at explaining relationships between variables.
Icon made by Freepik from www.flaticon.com
Data
Main sources of data
Our starting point for this project, for each borough, is an anonymised list of addresses from the electoral roll. We only had data on the number of people have registered at each of these addresses. We have merged in other data sets including:
- ONS population estimates: Mid-year population estimates, that are broken down by output area and age
- UKBuildings database: This innovative dataset combines categorisations of types and age of data using aerial satelite photography, then further completed with local authority information.
- Indices of deprivation: Provided by the Ministry of Housing, Communities and Local Government, this data set provides a range of different indicators of deprivation
- Census data: Wealth of data going back to the 2011 Census. Of particular importance for this project is data on tenure, education and ethnicity of residents
- Energy (EPC)
- Land registry
- Consumer Data Research Centre (UCL): Research centre based at UCL that gathers and produces data on many different topics. For this project, we have used up to date data on population churn, and ethnic makeup of neighbourhoods.
Geographic area
- The data was aggregated to Census Output (OA) level:
- Typically a population of around 500 people
- There are 886 OAs in Lewisham
Details about how I have calculated voter registration
This hasn’t been as simple as I had hoped it would. We often don’t
Clustering
Aims
Aims of clustering
Unsupervised learning consists of statistical methods to extract meaning from data without categorising any of the data before. In this sense, unsupervised learning expects the machine to decide what the categories are. that Clustering data can be described as ‘the art of finding groups in data’. Classifying similar objects into groups is an important and natural human activity and a prominent part of science, history, government, education and marketing. The main reasons for using this kind of technique are:
- to identify distinct groups within a data set
- as an extension of exploratory data analysis
- to gain insight into how the data and how variables relate to one another
Given that OAs are purely administrative inventions - can we see meaningful groups emerge? Or is the data too noisy? The aim of this part of the analysis is to run, explain and measure the performance of buildings clusters on our data set. I have focused this part of the analysis on building age and type.
Buildings Data
At the moment, this data is only for Lewisham Council. I have run the clustering algortithm on buildings data, which has been split into categories for type and age.
Type: Flat block | Converted Flats | House (detached/semi-detached) | Terraced house
Age: Victorian/pre-WW1 | Interwar | Postwar (1945-1979) | Modern (1980-)
For each OA, the percentage of addresses that fall into each of the above categories is calculated. The clusters are based on these variables. What we are looking for here is whether or not OAs fit into neat categories of building types.
Algorithm
This analysis is carried out using k-means clustering, where we need to specify the number of categories, then the algorithms find that number of categories in the data. This algorithm starts off divides the data into K categories (a number chosen by whoever is running the algorithm) and finds centres in the data that minimise the sum of the squared distances from the data points to those centres.
Main findings
Five different clusters of different buildings seemed to emerge from the data, that separated out:
- Victorian terraced housing, often converted into flats
- High proportions of flat blocks built in the post-war (1945-79)
- Terraced houses built between the wars
- Modern flat blocks
- Areas with more houses, mostly interwar
The two interwar categories have the highest levels of voter registration. The modern flat blocks have the lowest. The combination of higher than average deprivation, a younger population and more renters mean that t
The clustering here is not as well defined as we thought it might be. We think there are two main reasons for this:
- A lot of development have been quite locally based, so large areas with the same type of housing is not very widespread
- OAs are a quite arbitrary boundaries, drawn up for administrative purposes not to reflect any kind of neighbourhood, which makes this geographic area arguably unsuitable for clustering. Looking at streets might be better.
Frequency
Labelling the clusters
The algorithm produced the clusters, which I have named according to their characteristics (see next page to explore them yourself). It should be noted that in each of the clusters, a variety of different housing can be found. I have named them according to how prominent different types/ages of housing appear as compared to their average share.
Frequency
|
Cluster
|
Label
|
Frequency
|
|
1
|
Postwar flats
|
202
|
|
2
|
Inter-war houses
|
67
|
|
3
|
Modern flats
|
102
|
|
4
|
Interwar terraces
|
131
|
|
5
|
Victorian converted
|
382
|
Relationship with voter registration
The distribution of voter registration for each of the clusters is shown below. In these box plots the interquartile range is represented by the box, then median is shown as the line in the middle and the separate dots are the outliers. This shows that the ‘modern flats’ has the lowest levels and the two interwar clusters have the highest.
Viz
Select one of the named clusters to see the distribution of different housing types, housing age, deprviation, average age and renting for each of the clusters.
\n Another para?
In the top chart, the percentage of housing types and age are shown. If the colour of the bar is red, this means it is much lower than the average value for the borough. From there, the more green it is, the more above the average. For example, if the bar for flat blocks is bright green, this means that it is much higher than the average, if it is bright red, it is much lower and then the closer it is to brown, the closer it is to the average. This allows you to see the actual values represented in the chart, while still being able to understand how this compares to the whole borough.
In the bottom chart, the categories have different scales (IMD is 1:10, average age is usually between 30 and 70 and the missing and private rented ones are percentages). So I have presented this as a scaled value, according to the standard deviation.
The most common category has the highest number of missing values in the buildings data set. This is largely because houses that have converted into flats are least likely to be kept up to date.
Dashboard
The Modern flats OAs are most frequent close to the river, north of the New Cross Road. Interwar terraces and houses are found more in the South-East in places Bellingham and Downham. Not many of the areas in this category are north of the A205. Many of postwar flats OAs are in the south-west in areas like Sydenham and Perry Vale. The Victorian terraces are found in the centre.
It is worth noting that each of the clusters can be found throughout and there are no obvious/striking patterns.
Silhouette
About this chart
This ‘silhouette chart’ represents how well the clustering model works. In very well clustered data, each of the bars would not go below the red dotted line (and certainly wouldn’t be below the 0 line!)
This chart means that most of the data does fit quite well into the five clusters, but a lot of it doesn’t. As we know the data is very noisy, and OAs are not ideal to perform clustering. The message from this chart is that quite a few areas are so mixed that you just cannot categorise them easily.

Predictive Modelling
Aims
What I am trying to achieve with this analysis, is to produce a models that, based on the data we have collected, can accurately predict voter registration at OA-level and/or address-level.
I will be using three models to do this:
- Linear/logistic regression
- Decision tree
- Random forest
The most interesting outputs from each of these models will be the way in which they come to their predictions, as much as the predictions themselves.
For the regression models, this will be explaining which variables are most important for predicting voter registration. By using the decision tree, I should be able explain combinations of values between different variables as well as estimating thresholds. The random forest model is a type of decision tree model, but should be more accurate (though less straightforward to explain)
Linear regression
Software/Code
This has been carried out entirely using the R programming language. For all of the code, please see the github page.
I will include some snippets of code here to explain how I ran some of the models.